1 Example hpgltool usage with a real data set (fission)

This document aims to provide further examples in how to use the hpgltools.

Note to self, the header has rmarkdown::pdf_document instead of html_document or html_vignette because it gets some bullcrap error ‘margins too large’…

1.1 Setting up

Here are the commands I invoke to get ready to play with new data, including everything required to install hpgltools, the software it uses, and the fission data.

if (! "BiocManager" %in% installed.packages()) {
  source("http://bioconductor.org/biocLite.R")
}
if (! "devtools" %in% installed.packages()) {
  biocManager::install("devtools")
}
if (! "hpgltools" %in% installed.packages()) {
  devtools::install_github("elsayed-lab/hpgltools")
}
if (! "fission" %in% installed.packages()) {
  biocManager::install("fission")
}
## I use the function 'sm()' to quiet loud functions.
tt <- sm(library(fission))
tt <- sm(data(fission))

1.2 Data import

All the work I do in Dr. El-Sayed’s lab makes some pretty hard assumptions about how data is stored. As a result, to use the fission data set I will do a little bit of shenanigans to match it to the expected format. Now that I have played a little with fission, I think its format is quite nice and am likely to have my experiment class instead be a SummarizedExperiment.

## Extract the meta data from the fission dataset
meta <- as.data.frame(fission@colData)
## Make conditions and batches
meta$condition <- paste(meta$strain, meta$minute, sep=".")
meta$batch <- meta$replicate
meta$sample.id <- rownames(meta)
## Grab the count data
fission_data <- fission@assays$data$counts
## This will make an experiment superclass called 'expt' and it contains
## an ExpressionSet along with any arbitrary additional information one might want to include.
## Along the way it writes a Rdata file which is by default called 'expt.Rdata'
fission_expt <- create_expt(metadata=meta, count_dataframe=fission_data)
## Reading the sample metadata.
## The sample definitions comprises: 36 rows(samples) and 7 columns(metadata fields).
## Matched 7039 annotations and counts.
## Bringing together the count matrix and gene information.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 7039 rows and 36 columns.

1.3 Normalizing and exploring data

There are lots of toys we have learned to use to play with with raw data and explore stuff like batch effects or non-canonical distributions or skewed counts. hpgltools provides some functionality to make this process easier. The graphs shown below and many more are generated with the wrapper ‘graph_metrics()’ but that takes away the chance to explain the graphs as I generate them.

## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
fis_libsize <- plot_libsize(fission_expt)
fis_libsize$plot

## Here we see that the wild type replicate 3 sample for 15 minutes has fewer non-zero genes than all its friends.
fis_nonzero <- plot_nonzero(fission_expt, labels="boring", title="nonzero vs. cpm")
fis_nonzero$plot

fis_density <- plot_density(fission_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fis_density$plot

knitr::kable(fis_density$condition_summary)
condition min 1st median mean 3rd max
wt.0 0.5 31 216 1839 647 5346309
wt.15 0.5 20 135 1819 474 6733064
wt.30 0.5 27 183 1917 559 6722419
wt.60 0.5 27 195 1676 627 5034452
wt.120 0.5 25 183 1540 529 4531663
wt.180 0.5 29 210 1904 600 7223993
mut.0 0.5 35 238 1826 716 4818286
mut.15 0.5 22 142 1582 461 4387929
mut.30 0.5 33 226 2028 684 6061779
mut.60 0.5 39 279 1981 772 5268796
mut.120 0.5 35 242 1763 675 4174104
mut.180 0.5 24 178 1534 510 4276682
knitr::kable(fis_density$batch_summary)
batch min 1st median mean 3rd max
r1 0.5 30 205 1908 632 6733064
r2 0.5 28 193 1727 586 7223993
r3 0.5 27 196 1717 600 6722419
knitr::kable(head(fis_density$sample_summary))
sample min 1st median mean 3rd max
GSM1368273 0.5 46 306 2226 869.0 4542884
GSM1368274 0.5 21 147 1345 411.0 4117160
GSM1368275 0.5 34 243 1946 678.0 5346309
GSM1368276 0.5 34 226 2625 751.5 6733064
GSM1368277 0.5 20 123 1471 393.0 4094540
GSM1368278 0.5 13 95 1360 312.0 4714248

1.3.1 An initial pca plot

In most cases, raw data does not cluster very well, lets see if that is also true for the fission experiment. Assuming it doesn’t, lets normalize the data using the defaults (cpm, quantile, log2) and try again.

## Something in this is causing a build loop on travis...
## Unsurprisingly, the raw data doesn't cluster well at all...
##fis_rawpca <- plot_pca(fission_expt, expt_names=fission_expt$condition, cis=NULL)
fis_rawpca <- plot_pca(fission_expt)
fis_rawpca$plot

## So, normalize the data
norm_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant", convert="cpm"))
## And try the pca again
fis_normpca <- plot_pca(norm_expt, plot_labels="normal", title="normalized pca", cis=NULL)
fis_normpca$plot

## Clearly time is an important factor in the data.

1.4 Try something for Najib

testing <- plot_pca(norm_expt, num_pc=3)
silly <- plot_3d_pca(testing)
silly$plot

In the final line of the preceeding block, I printed a summary of the return from plot_pca(). It contains the following information:

  • pca: The result from the fast.svd() call.
  • plot: The ggplot2 pca plot.
  • table: The metadata used to make the pca plot.
  • res: A table of the residual variance after each component by condition/batch.
  • variance: A numeric list of the %variance remaining after each PC.

With that in mind, lets perform some more pca plots after normalizing the data and see how different they look.

normbatch_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant",
                                    convert="cpm", batch="sva"))
fis_normbatchpca <- plot_pca(normbatch_expt,
                             title="Normalized PCA with batch effect correction.", cis=NULL)
fis_normbatchpca$plot

## ok, that caused the 0, 60, 15, and 30 minute samples to cluster nicely
## the 120 and 180 minute samples are still a bit tight

## pca_information provides some more information about the call to
## fast.svd that went into making the pca plot
fis_info <- pca_information(norm_expt,
                            expt_factors=c("condition","batch","strain","minute"),
                            num_components=6)
## More shallow curves in these plots suggest more genes in this principle component.

## The r^2 table shows that quite a lot of the variance in the data is explained by condition
knitr::kable(head(fis_info$rsquared_table))
prop_var cum_prop_var condition_rsquared batch_rsquared
36.64 36.64 0.9914 0.0002
14.01 50.65 0.9667 0.0038
8.43 59.08 0.6640 0.2510
5.81 64.89 0.3872 0.2659
5.18 70.07 0.8888 0.0514
3.63 73.70 0.1516 0.0095
## We can look at the correlation between the principle components and the factors in the experiment
## in this case looking at condition/batch vs the first 4 components.
knitr::kable(fis_info$pca_cor)
PC1 PC2 PC3 PC4 PC5 PC6
condition 0.3160 -0.2694 -0.0582 -0.3246 0.0374 0.2853
batch 0.0123 -0.0606 0.5008 -0.5105 -0.2260 0.0969
strain 0.0351 -0.0077 0.1354 -0.2467 0.0562 0.2782
minute 0.5772 -0.5310 -0.3556 -0.2228 -0.0230 0.0881
## And p-values to lend some credence(or not to those assertions)
knitr::kable(fis_info$anova_p)
PC1 PC2 PC3 PC4 PC5 PC6
condition 0.0604 0.1121 0.7358 0.0534 0.8284 0.0916
batch 0.9431 0.7255 0.0019 0.0015 0.1850 0.5739
strain 0.8391 0.9645 0.4310 0.1469 0.7449 0.1004
minute 0.0002 0.0009 0.0333 0.1915 0.8940 0.6094
## Try again with batch removed data
batchnorm_expt <- sm(normalize_expt(fission_expt, batch="limma", norm="quant",
                                    transform="log2", convert="cpm"))
fis_batchnormpca <- plot_pca(batchnorm_expt, plot_title="limma corrected pca")
fis_batchnormpca$plot

test_pca <- pca_information(batchnorm_expt,
                            expt_factors=c("condition","batch","strain","minute"),
                            num_components=6)
## More shallow curves in these plots suggest more genes in this principle component.

Interesting, the batch normalized pca plot looks much the same as the normalized. The variances are in fact pretty much the exact same…

1.5 Look at the data distributions

We have some tools which provide visualizations of the distribution of the data:

fission_boxplot <- sm(plot_boxplot(fission_expt))
fission_boxplot

sf_expt <- sm(normalize_expt(fission_expt, norm="sf"))
fission_boxplot <- sm(plot_boxplot(sf_expt))
fission_boxplot

tm_expt <- sm(normalize_expt(fission_expt, norm="tmm"))
fission_boxplot <- sm(plot_boxplot(tm_expt))
fission_boxplot

rle_expt <- sm(normalize_expt(fission_expt, norm="rle"))
fission_boxplot <- sm(plot_boxplot(rle_expt))
fission_boxplot

up_expt <- sm(normalize_expt(fission_expt, norm="upperquartile"))
fission_boxplot <- sm(plot_boxplot(up_expt))
fission_boxplot

fission_density <- plot_density(norm_expt)
fission_density$plot

fission_density <- plot_density(sf_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fission_density$plot

fission_density <- plot_density(tm_expt)
## This data will benefit from being displayed on the log scale.
## If this is not desired, set scale='raw'
## Some entries are 0.  We are on log scale, setting them to 0.5.
## Changed 24130 zero count features.
fission_density$plot

compare_12 <- plot_single_qq(fission_expt, x=1, y=2)
compare_12$log

1.6 See how they cluster

Ok, so we can further check out how the data cluster with respect to one another…

fission_cor <- plot_corheat(norm_expt)

fission_cor$plot
fission_cor <- plot_corheat(batchnorm_expt)

fission_cor$plot
fission_dis <- plot_disheat(norm_expt)

fission_dis$plot
fission_dis <- plot_disheat(batchnorm_expt)

fission_dis$plot

2 variancePartition

variancePartition may be used to seek out which experimental factors correlate with the most variance in the data.

test_varpart <- simple_varpart(fission_expt, predictor=NULL, factors=c("condition","batch"))
## Attempting mixed linear model with: ~  (1|condition) + (1|batch)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading

## Warning in serialize(data, node$con, xdr = FALSE): 'package:variancePartition'
## may not be available when loading
## 
## Total:22 s
## Placing factor: condition at the beginning of the model.
test_varpart$percent_plot

test_varpart$partition_plot

## Here, let us test the variance contributed by strain, time, and replicate.
test_varpart <- simple_varpart(fission_expt, predictor=NULL, factors=c("condition", "strain", "minute", "replicate"))
## Attempting mixed linear model with: ~  (1|condition) + (1|strain) + (1|minute) + (1|replicate)
## Fitting the expressionset to the model, this is slow.
## Dividing work into 100 chunks...
## 
## Total:27 s
## Placing factor: condition at the beginning of the model.
test_varpart$percent_plot

test_varpart$partition_plot

YAY!

pander::pander(sessionInfo())

R version 3.6.1 (2019-07-05)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: variancePartition(v.1.16.1), fission(v.1.6.0), ruv(v.0.9.7.1), SummarizedExperiment(v.1.16.1), DelayedArray(v.0.12.2), BiocParallel(v.1.20.1), matrixStats(v.0.55.0), GenomicRanges(v.1.38.0), GenomeInfoDb(v.1.22.0), IRanges(v.2.20.2), S4Vectors(v.0.24.3), hpgltools(v.1.0), Biobase(v.2.46.0) and BiocGenerics(v.0.32.0)

loaded via a namespace (and not attached): R.utils(v.2.9.2), tidyselect(v.1.0.0), lme4(v.1.1-21), RSQLite(v.2.2.0), AnnotationDbi(v.1.48.0), htmlwidgets(v.1.5.1), grid(v.3.6.1), Rtsne(v.0.15), devtools(v.2.2.2), DESeq(v.1.38.0), munsell(v.0.5.0), codetools(v.0.2-16), preprocessCore(v.1.48.0), withr(v.2.1.2), colorspace(v.1.4-1), highr(v.0.8), knitr(v.1.28), rstudioapi(v.0.11), Vennerable(v.3.1.0.9000), robustbase(v.0.93-5), genoPlotR(v.0.8.9), labeling(v.0.3), GenomeInfoDbData(v.1.2.2), bit64(v.0.9-7), farver(v.2.0.3), rprojroot(v.1.3-2), vctrs(v.0.2.4), xfun(v.0.12), fastcluster(v.1.1.25), R6(v.2.4.1), doParallel(v.1.0.15), locfit(v.1.5-9.1), bitops(v.1.0-6), assertthat(v.0.2.1), promises(v.1.1.0), scales(v.1.1.0), nnet(v.7.3-12), gtable(v.0.3.0), sva(v.3.34.0), processx(v.3.4.2), rlang(v.0.4.5), genefilter(v.1.68.0), splines(v.3.6.1), rtracklayer(v.1.46.0), lazyeval(v.0.2.2), acepack(v.1.4.1), selectr(v.0.4-2), checkmate(v.2.0.0), yaml(v.2.2.1), reshape2(v.1.4.3), crosstalk(v.1.0.0), backports(v.1.1.5), httpuv(v.1.5.2), Hmisc(v.4.3-1), RBGL(v.1.62.1), tools(v.3.6.1), usethis(v.1.5.1), ggplot2(v.3.2.1), ellipsis(v.0.3.0), gplots(v.3.0.1.2), RColorBrewer(v.1.1-2), sessioninfo(v.1.1.1), Rcpp(v.1.0.3), plyr(v.1.8.5), progress(v.1.2.2), base64enc(v.0.1-3), zlibbioc(v.1.32.0), purrr(v.0.3.3), RCurl(v.1.98-1.1), ps(v.1.3.2), prettyunits(v.1.1.1), rpart(v.4.1-15), ggrepel(v.0.8.1), cluster(v.2.1.0), colorRamps(v.2.3), fs(v.1.3.1), magrittr(v.1.5), data.table(v.1.12.8), openxlsx(v.4.1.4), pkgload(v.1.0.2), hms(v.0.5.3), mime(v.0.9), evaluate(v.0.14), xtable(v.1.8-4), pbkrtest(v.0.4-7), XML(v.3.99-0.3), jpeg(v.0.1-8.1), gridExtra(v.2.3), testthat(v.2.3.1), compiler(v.3.6.1), tibble(v.2.1.3), KernSmooth(v.2.23-16), crayon(v.1.3.4), minqa(v.1.2.4), R.oo(v.1.23.0), htmltools(v.0.4.0), mgcv(v.1.8-31), corpcor(v.1.6.9), later(v.1.0.0), Formula(v.1.2-3), tidyr(v.1.0.2), geneplotter(v.1.64.0), DBI(v.1.1.0), MASS(v.7.3-51.5), boot(v.1.3-24), Matrix(v.1.2-18), ade4(v.1.7-15), readr(v.1.3.1), cli(v.2.0.1), quadprog(v.1.5-8), R.methodsS3(v.1.8.0), gdata(v.2.18.0), pkgconfig(v.2.0.3), GenomicAlignments(v.1.22.1), foreign(v.0.8-75), plotly(v.4.9.2), xml2(v.1.2.2), foreach(v.1.4.8), annotate(v.1.64.0), XVector(v.0.26.0), rvest(v.0.3.5), stringr(v.1.4.0), callr(v.3.4.2), digest(v.0.6.25), graph(v.1.64.0), Biostrings(v.2.54.0), rmarkdown(v.2.1), htmlTable(v.1.13.3), edgeR(v.3.28.0), directlabels(v.2020.1.31), curl(v.4.3), shiny(v.1.4.0), Rsamtools(v.2.2.2), gtools(v.3.8.1), nloptr(v.1.2.1), lifecycle(v.0.1.0), nlme(v.3.1-144), jsonlite(v.1.6.1), desc(v.1.2.0), viridisLite(v.0.3.0), limma(v.3.42.2), fansi(v.0.4.1), pillar(v.1.4.3), lattice(v.0.20-40), fastmap(v.1.0.1), httr(v.1.4.1), DEoptimR(v.1.0-8), pkgbuild(v.1.0.6), survival(v.3.1-8), glue(v.1.3.1), remotes(v.2.1.1), zip(v.2.0.4), png(v.0.1-7), iterators(v.1.0.12), pander(v.0.6.3), bit(v.1.1-15.2), stringi(v.1.4.6), blob(v.1.2.1), DESeq2(v.1.26.0), latticeExtra(v.0.6-29), caTools(v.1.18.0), memoise(v.1.1.0) and dplyr(v.0.8.4)

---
title: "hpgltools examples using the fission dataset"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
    collapsed: false
    smooth_scroll: false
vignette: >
  %\VignetteIndexEntry{b-02_fission_data_exploration}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r options, include=FALSE}
## These are the options I tend to favor
library("hpgltools")
knitr::opts_knit$set(progress=TRUE,
                     verbose=TRUE,
                     width=90,
                     echo=TRUE)
knitr::opts_chunk$set(error=TRUE,
                      fig.width=8,
                      fig.height=8,
                      dpi=96)
old_options <- options(digits=4,
                       stringsAsFactors=FALSE,
                       knitr.duplicate.label="allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size=10))
set.seed(1)
rmd_file <- "b-02_fission_data_exploration.Rmd"
```

# Example hpgltool usage with a real data set (fission)

This document aims to provide further examples in how to use the hpgltools.

Note to self, the header has rmarkdown::pdf_document instead of html_document or html_vignette
because it gets some bullcrap error 'margins too large'...

## Setting up

Here are the commands I invoke to get ready to play with new data, including everything
required to install hpgltools, the software it uses, and the fission data.

```{r setup}
if (! "BiocManager" %in% installed.packages()) {
  source("http://bioconductor.org/biocLite.R")
}
if (! "devtools" %in% installed.packages()) {
  biocManager::install("devtools")
}
if (! "hpgltools" %in% installed.packages()) {
  devtools::install_github("elsayed-lab/hpgltools")
}
if (! "fission" %in% installed.packages()) {
  biocManager::install("fission")
}
## I use the function 'sm()' to quiet loud functions.
tt <- sm(library(fission))
tt <- sm(data(fission))
```

## Data import

All the work I do in Dr. El-Sayed's lab makes some pretty hard
assumptions about how data is stored.  As a result, to use the fission
data set I will do a little bit of shenanigans to match it to the
expected format.  Now that I have played a little with fission, I
think its format is quite nice and am likely to have my experiment
class instead be a SummarizedExperiment.

```{r data_import}
## Extract the meta data from the fission dataset
meta <- as.data.frame(fission@colData)
## Make conditions and batches
meta$condition <- paste(meta$strain, meta$minute, sep=".")
meta$batch <- meta$replicate
meta$sample.id <- rownames(meta)
## Grab the count data
fission_data <- fission@assays$data$counts
## This will make an experiment superclass called 'expt' and it contains
## an ExpressionSet along with any arbitrary additional information one might want to include.
## Along the way it writes a Rdata file which is by default called 'expt.Rdata'
fission_expt <- create_expt(metadata=meta, count_dataframe=fission_data)
```

## Normalizing and exploring data

There are lots of toys we have learned to use to play with with raw
data and explore stuff like batch effects or non-canonical
distributions or skewed counts.  hpgltools provides some functionality
to make this process easier.  The graphs shown below and many more are
generated with the wrapper 'graph_metrics()' but that takes away the
chance to explain the graphs as I generate them.

```{r norm_explore}
## First make a bar plot of the library sizes in the experiment.
## Notice that the colors were auto-chosen by create_expt() and they should
## be maintained throughout this process
fis_libsize <- plot_libsize(fission_expt)
fis_libsize$plot
## Here we see that the wild type replicate 3 sample for 15 minutes has fewer non-zero genes than all its friends.
fis_nonzero <- plot_nonzero(fission_expt, labels="boring", title="nonzero vs. cpm")
fis_nonzero$plot

fis_density <- plot_density(fission_expt)
fis_density$plot
knitr::kable(fis_density$condition_summary)
knitr::kable(fis_density$batch_summary)
knitr::kable(head(fis_density$sample_summary))
```

### An initial pca plot

In most cases, raw data does not cluster very well, lets see if that
is also true for the fission experiment. Assuming it doesn't, lets
normalize the data using the defaults (cpm, quantile, log2) and try
again.

```{r pca}
## Something in this is causing a build loop on travis...
## Unsurprisingly, the raw data doesn't cluster well at all...
##fis_rawpca <- plot_pca(fission_expt, expt_names=fission_expt$condition, cis=NULL)
fis_rawpca <- plot_pca(fission_expt)
fis_rawpca$plot
## So, normalize the data
norm_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant", convert="cpm"))
## And try the pca again
fis_normpca <- plot_pca(norm_expt, plot_labels="normal", title="normalized pca", cis=NULL)
fis_normpca$plot
## Clearly time is an important factor in the data.
```

## Try something for Najib

```{r test_3d}
testing <- plot_pca(norm_expt, num_pc=3)
silly <- plot_3d_pca(testing)
silly$plot
```

In the final line of the preceeding block, I printed a summary of the return from plot_pca().
It contains the following information:

* pca:      The result from the fast.svd() call.
* plot:     The ggplot2 pca plot.
* table:    The metadata used to make the pca plot.
* res:      A table of the residual variance after each component by condition/batch.
* variance: A numeric list of the %variance remaining after each PC.

With that in mind, lets perform some more pca plots after normalizing the data and see how different
they look.

```{r normalized_pca}
normbatch_expt <- sm(normalize_expt(fission_expt, transform="log2", norm="quant",
                                    convert="cpm", batch="sva"))
fis_normbatchpca <- plot_pca(normbatch_expt,
                             title="Normalized PCA with batch effect correction.", cis=NULL)
fis_normbatchpca$plot
## ok, that caused the 0, 60, 15, and 30 minute samples to cluster nicely
## the 120 and 180 minute samples are still a bit tight

## pca_information provides some more information about the call to
## fast.svd that went into making the pca plot
fis_info <- pca_information(norm_expt,
                            expt_factors=c("condition","batch","strain","minute"),
                            num_components=6)
## The r^2 table shows that quite a lot of the variance in the data is explained by condition
knitr::kable(head(fis_info$rsquared_table))
## We can look at the correlation between the principle components and the factors in the experiment
## in this case looking at condition/batch vs the first 4 components.
knitr::kable(fis_info$pca_cor)
## And p-values to lend some credence(or not to those assertions)
knitr::kable(fis_info$anova_p)

## Try again with batch removed data
batchnorm_expt <- sm(normalize_expt(fission_expt, batch="limma", norm="quant",
                                    transform="log2", convert="cpm"))
fis_batchnormpca <- plot_pca(batchnorm_expt, plot_title="limma corrected pca")
fis_batchnormpca$plot
test_pca <- pca_information(batchnorm_expt,
                            expt_factors=c("condition","batch","strain","minute"),
                            num_components=6)
```

Interesting, the batch normalized pca plot looks much the same as the
normalized. The variances are in fact pretty much the exact same...

## Look at the data distributions

We have some tools which provide visualizations of the distribution of
the data:

```{r distributions}
fission_boxplot <- sm(plot_boxplot(fission_expt))
fission_boxplot
sf_expt <- sm(normalize_expt(fission_expt, norm="sf"))
fission_boxplot <- sm(plot_boxplot(sf_expt))
fission_boxplot
tm_expt <- sm(normalize_expt(fission_expt, norm="tmm"))
fission_boxplot <- sm(plot_boxplot(tm_expt))
fission_boxplot
rle_expt <- sm(normalize_expt(fission_expt, norm="rle"))
fission_boxplot <- sm(plot_boxplot(rle_expt))
fission_boxplot
up_expt <- sm(normalize_expt(fission_expt, norm="upperquartile"))
fission_boxplot <- sm(plot_boxplot(up_expt))
fission_boxplot

fission_density <- plot_density(norm_expt)
fission_density$plot
fission_density <- plot_density(sf_expt)
fission_density$plot
fission_density <- plot_density(tm_expt)
fission_density$plot

compare_12 <- plot_single_qq(fission_expt, x=1, y=2)
compare_12$log
```

## See how they cluster

Ok, so we can further check out how the data cluster with respect to
one another...

```{r clustering}
fission_cor <- plot_corheat(norm_expt)
fission_cor$plot
fission_cor <- plot_corheat(batchnorm_expt)
fission_cor$plot
fission_dis <- plot_disheat(norm_expt)
fission_dis$plot
fission_dis <- plot_disheat(batchnorm_expt)
fission_dis$plot
```

# variancePartition

variancePartition may be used to seek out which experimental factors correlate with
the most variance in the data.

```{r variancePartition}
test_varpart <- simple_varpart(fission_expt, predictor=NULL, factors=c("condition","batch"))
test_varpart$percent_plot
test_varpart$partition_plot

## Here, let us test the variance contributed by strain, time, and replicate.
test_varpart <- simple_varpart(fission_expt, predictor=NULL, factors=c("condition", "strain", "minute", "replicate"))
test_varpart$percent_plot
test_varpart$partition_plot
```

YAY!

```{r sysinfo, results='asis'}
pander::pander(sessionInfo())
```
